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i? ■ Abstract 



For large scale learning problems, it is de sirable if we can o btain the optimal model parameters by 

0^ ' going through the data in only one pass. IPolvak and Juditskv (1992 ) showed that asymptotically 

Cn , the test performance of the simple average of the parameters obtained by stochastic gradient 

descent (SGD) is as good as that of the parameters which minimize the empirical cost. However, to 

our knowledge, despite its optimal asymptotic convergence rate, averaged SGD (ASGD) received 

little attention in recent research on large scale learning. One possible reason is that it may 

take a prohibitively large number of training samples for ASGD to reach its asymptotic region 

t/3 , for most real problems. I n this paper, we present a finite sample analysis for the method of 

O ■ IPolvak and Juditskvl ([1993). Our analysis shows that it indeed usually takes a huge number of 

samples for ASGD to reach its asymptotic region for improperly chosen learning rate. More 

^vj , importantly, based on our analysis, we propose a simple way to properly set learning rate so that 

^ ' it takes a reasonable amount of data for ASGD to reach its asymptotic region. We compare ASGD 

f^ ' using our proposed learning rate with other well known algorithms for training large scale linear 

O^ ■ classifiers. The experiments clearly show the superiority of ASGD. 

Keywords: stochastic gradient descent, large scale learning, support vector machines, stochastic 
optimization 



1. Introduction 



For prediction problems, we want to find a function fe{x) with parameter 9 to predict the value 
of the outcome variable y given an observed vector x. Typically, the problem is formulated as an 
/\ ' optimization problem: 



1 * 
e;=^.Tgmm-J2iL{feix^),y^) + Rm (1) 

where t is the number of data points, 9^ is the parameter that minimize the empirical cost, {xi, yi) 
are the i*'' training example, L(s, y) is a loss function which gives small value if s is a good prediction 
for y, and R{9) is a regularization function for 9 which typically gives small value for small 9. Some 
commonly used L are: max(0, 1 — ys) for support vector machine (SVM), 2(max(0, 1 — ys))^ for 
L2 SVM, and \{y — s)^ for linear regression. Some commonly used regularization functions are: L2 
regularization ^||6'|p, and LI regularization A||0||i. 

For large scale machine learning problems, we need to deal with optimization problems with 
millions or even billions of training samples. The classical optimization techniques such as interior 
point methods or conjugate gradient descent have to go through all data points to just evaluate the 
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objective once. Not to say that they need to go through the whole data set many times in order to 
find the best 6. 

On the other ha nd, stochastic gradient descent (SGD) has been shown to have great promise for 
large scale learning dZhane . 2004 Hazan et ahl. 2006tlSha.lev-Shwartz et aLl . l2007tlBottou and Bousauet , 



l2008t IShalev-Shwartz and Tewarii l2009t iLangford et al.l 120091) . Let d = {x, y) be one data sample, 
l{6,d) = L{fg{x),y) + R{9) be the cost of 9 for d, g{9,£) = gg ' be the gradient function, and 
Dt = (di, • • • , dt) be all the training samples at i*'' step. The SGD method updates 9 according to 
its stochastic gradient: 



7tg(6't_i,dt) 



(2) 



where jt is learning rate at the t step. 7* can be either a scalar or a matrix. Let the expected 
loss of 9 over test data be £{9) = Ed{l{9, d)), the optimal parameter be 9* = argming £{9), and the 



d9deT 



Note that 9^ and 0^ are random variables depending on Dt- 



Hence 



Hessian be H 

both £{9t) and £{9^) are random variables depending on Dt- If "ft is a scalar, the best asymptotic 
convergence for the expected excess loss EDt{£{9t)) — £{9*) is 0(t^^), which is obtained by using 
It = 70(1 + 7oAoi)~"'^, where Ao is the smallest eigenvalue of H and 70 is some constant. The 
asymptotic convergence rate of SGD can be potentially benefit from using second order information 



ings 



(jBottou and Bousaueti l2008t ISchraudolph et al.l , l2007t lAmari et al.l . 120001 ) . The optimal asymptotic 
convergence rate is achieved by using matrix valued learning rate 7^ = jH~^ . If this optimal matrix 
step size is used, then asymptotically second order SGD is as good as explicitly optimizing the 
empirical loss. More precisely, this means that both tEDt[£{Ot) — £{9*)) and tEDt{£{9l) — £{9*)) 
converge to a same positive constant. 



Si nce H is unknown in advance, methods for adaptively estimating H is proposed (jBottou and LeCunl . 

2005t lAmari et al.l . 120001 ) . However, for high dimensional data sets, maintaining a full matrix H is 
too computat ionally expens ive. Hence various m ethods for approx imating E[ hav e been proposed 
(|LeCunet al.l .[l998: Schra udolph et al.l [2007: Ro ux et all . [2008: Bo rdes et al.l . l2009t) . However, with 
the approximated H, the optimal convergence cannot be guaranteed. It is worth to point out that 
most of the existing analysis for second order SGD is asymptotic, namely, that they do not tell how 
much data is needed for the algorithm to reach their asymptotic region. 

In order to accelerate the convergenc e speed of SGD, averaged stochastic gradient (ASGD) was 
proposed in iPolvak and Juditskvl (|l992[ ). For ASGD, the r unning average 9t = 7 T^j- i ^j of the 
parameters obtained by SGD is used as the estimator for 9*. IPolvak and Juditskvl (|l992[ ) showed a 
very nice result that 9t converges to 9* as good as full second order SGD, which means that if there 
are enough training samples, ASGD can obtain the parameter as good as the empirical optimal 
parameter 9^ in just one pass of data. And another advantage of ASG D is that, unlike s econd 
order SGD, ASGD is extremely easy to implement. Izhanei (J2004I) : iNemirovski et al.l ( 20091 ) gave 
some nice non-asymp totic analysis f or ASGD with a fi x ed lea rning rate. However, the convergence 



bounds obtained by IZhaneJ ( 2004[ ): INemirovski et al.l ( 20091 ) are far less appealing than that of 



Polvak and Juditskvl (|1992l) . 

Despite its nice properties, ASGD receives little attention in recent research for online large scale 
learning. The reason for the lack of interest in ASGD might be that its potential good convergence 
has not been realized by researchers in real applications. Our analysis shows the cause of this may 
due to the fact the ASGD needs a prohibitively large amount of data to reach asymptotics if learning 
rate is chosen arbitrarily. 

A typical choice for the learning rate 74 is to make it decease as fast as 8(t~^) for some constant 
c. In this paper, we assume a particular form of learning rate schedule which satisfies this condition. 



7t = 70(1 + a7oi)" 



(3) 



where 70, a and c are some constants. Based on this form of learning rate schedule, we provide non- 
asymptotic analysis of ASGD. Our analysis shows that 70 and a should to be properly set according 



to the curvature of the expected cost function, c should be a problem independent constant. With 
our recipe for setting the learning rate, we show that ASGD outperforms SGD if the data size is 
large enough for SGD to reach its asymptotic region. 

To demonstrate the effectiveness of ASGD with the proposed learning rate schedule, we apply 
ASGD for training linear classification and regression models. We compare ASGD with other promi- 
nent large scale SVM solvers on several benchmark tasks. Our experimental results show the clear 
advantage of ASGD. 

In the rest of the paper, for matrices X and Y , X <Y means F — X is positive semi-definite, 
\\x\\a is defined as \/x^ Ax. We will assume 74 = 70(1 + 070^)^*^ for some constant 70 > 0, a > 
and < c < 1 in all the theorems and lemmas. Through out this paper we denote At = Ot — 0* and 
At = 9t — 9* ■ To help the reader focus on the main idea, we put most proofs to the Appendix. 

The paper is organized as follows: Section [5] establish some results on stochastic linear equa- 
tion; Section [3] extends the result to ASGD for quadratic loss functions; Section H] works on general 
non-quadratic loss functions; Section [S] discusses some implementation issues; Section [5] shows ex- 
perimental results; Section [7] concludes the paper; and Appendix includes all the proofs. 



2. Stochastic Linear Equation 

To motivate the problem, we first take a close look at the SGD update ([2|). Let g{0) = E{g{d,d)) 

de 



and the first order Taylor expansion of g{6) around 9* be A9 — b, where A — ^^ ' 



b = A9* - g{9*) = A9*. Then g{9t-i,d) can be decomposed as: 



and 



g{9t-i,d) = {A9t-i -b)+ g{9\d) + {g{9t-i,d) - g{9*,d) - gi9t~i)) + (^-1) - M-i + b) 

where ^[''> = g{9*,dt), ^'^ = 9{9t^i,dt) - g{9\dt) - g{9t-i) and ^f ^ = 5(^^-1) - A9t-i + b. So the 
SGD update ([2|) can be re-written as 

9t = 9t-i - 7*(M-i - b + ^l'^ + ep) + d'^) (4) 



(2) . 

identical distribution for different t. ^l is also martingale with respect to df. However, as we 
will see in later section, its magnitude depends on 9t-i — 9*. If g{9,d) is smooth, we have Q — 



It is easy to see that Q is martingale with respect to dt, i.e., E{Q \di, ■ ■ ■ ,dt-i) = 0, and has 

is alf 
pends 
0{\\9t^i-9*\\). For smooth g(6'), we have d^^ = o(||6lf_i-r ||). Both ^}^^ and f^^''^ are asymptotically 
negligible if suitable conditions are met. We also note that ^ -^ = for quadratic l{9,^). 

By the above analysis, we first consider the following simple stochastic approximation procedure 
which ignores Q and Q : 

9t^9t-i-it{A9t-i-b + it) (5) 

Ot = \jz<^^ (6) 

where A is a positive definite matrix with the smallest eigenvalue Aq and the largest eigenvalue Ai , 
^t is martingale difference process, i.e., -E(^t|fi, • • • , 6-1) = Oi the variance of ^t is E{£^t£,J) = S. We 
will see that this algorithm can be used to find the root 9* of equation A9 = b 

Theorem 1 //70A1 < 1 and (2c— \)a < Aq, then the estimator 9t in ^ satisfies: 



lot 



A- 



wher 



Co 



ac(l + ac7o) 



(Ao — max(0, 2c — l)a) 
The immediate conclusion from Theorem [1] is the asymptotic convergence bound of Ot- 
Corollary 2 9t in ^ satisfies 



tE{\\et 



i) <tTiA-^S) + Oit 



-(l-c) 



The above bound is consistent with Theorem 1 in lPolvak and Juditskvl ( 19921) and is th e best possible 
asymptotic convergence rate that can be achieved by any algorithms (JFabianl . 119731 ) . However, we 
are more interested in the non-asymptotic behavior of 9t ■ 

Corollary 3 // we choose a = Xq, it takes t = 0((Ao7o)^^) samples for 9t in @) to reach the 
asymptotic region. And at this point, Of begins to become better than Of. 

Proof Let t = T-^, we have 



i?(||A,||^)<ii±^||Ao||i 



Ao7o 



1 



(2co + c2)(l + X) 



c-l 



ii{A-^S) 



^2 i'--'- ' K 

On the other hand, the best possible convergence for 9t is obtained with a = Ag and c = 1: 



£;(||A 



t\\A, 



< 



lAolli 7otr(5) 
{1 + K)^ l + K 



(7) 



(8) 



We omit the proof of ([5]) , which is similar to that of Theor em [TJ A related (but not exactly same) 
result can be found in section 2.1 of lNemirovski et alj ( 20091 ). From ([7]) and (|5]) we can see that both 
9t and 9t need t — 0((Ao7o)~^) to reach their asymptotic region. However, at this point, 9t begins 
to become better than 9t because Aotr(A^^S') < tr(5). ■ 



Corollary 4 It takes i = f2 f y- ) (Aq7o) ^ I samples for 9t in (0) to reach the asymptotic 
region. 

Proof In order for 9t to reach its asymptotic region, we need at least the second term of the right 
hand side of the bound in Theorem [1] to be less than tr(^^^S'), which is to say 

^coO + oTot)^ ^ ^ 



Hence 



.>^(?-)- = (?5) 



By Corollary m we should limit a in order to have fast convergence. For the linear problem ([S]), 
we should always use a = 0. If we use some arbitrary value such as 1 f or a, although 9t still has 
asymptotic optimal convergence according to iPolvak and Juditskvl (jl992i ) , but it needs much more 
samples to reach the asymptotic region in situations where Aq is very small. For the general SGD 
update (HI), we need to trade-off against the convergence of S}-"^^ and S}^'^\ Hence a should not be 0. 
In general, a should be a constant factor times of Aq. 



3. Regression Problem 

In this section, we will analyze the convergence for regression problems. As we noted in section [2J 
the SGD update can be decomposed as ([4]), where £,1=0 for quadratic loss of linear regression. 
As in the proof of Theorem [U At can be written as: 

A. = -Lx* Ao + ] ± X]^^ + \ t X]^ = /(°) + /W + /(^) 

We already have a bound for ||/''°-'||a and ||/^^''||yi in Theorem[T] Now we work on J^^^. We will 
make two assumptions: 



^(llefll^-i|^,-i)<ci|lA,_i|l^ (9) 

t t 

Y,E{\\^^\\\\e,^{)<c2\\^,-l\\\ + c^Y,lt (lo) 



»=j «=j 



© is related to the continuity oi g{9, d) and the distribution of y. pH]) i s related to the conv ergence 
of standard SGD. A bound similar to (|10p can be found in section 3.1 of iHazan et al.l ( 2006 ). Using 



these assumptions, we can bound i?||/'-^^||^. 
Lemma 5 With Assumption ^ klO\) , we have 

ti?ii/(^)|ii < (1 + cofc, (ii^ii Aoii^ + r^(i + «^°^)"^) (11) 

With the above lemma, we can obtain the following asymptotic convergence result: 
Corollary 6 For quadratic loss, with assumption (0) I110\) . 6t satisfies 

tE\\9t ~ 9*\\\ < tT{A'^S) + O (i"'/^) + O (t-^^-^A 
Proof Note that 

(i;||Aji)V2 < (i?||/(")||i)V2 + (S||/(l)||i)V2 + (iJ||/(2) 11^)1/2 

The corollary follows by applying p^ . (IT71) and Lemma [5] ■ 



The best convergence rate is obtained when c = 2/3. Now we take a close look at the constant factor 
ci in assumption ([9]) to have a better understanding of the non-asymptotic behavior of ii?||/'^^||^. 

Lemma 7 For ridge regression l{9,d) = ^i^^x — y)'^ , if \\x\\ < M, then 



E 



(iiefii^.|^.-i)<^iiA.-iiii 



Assuming |jx|j = Af, Lemma fT2l in the Appendix shows that ||At|p will diverge if learning rate 
is greater than -p-. So 70 < ]g and ci < j^. Plugging these bounds for ci and 70 into Lemma O we 



have the following for t — ^ 



Ao7o ' 

C370 



i^n/^-^iii<2(i+co)-( (^+^-^y^°"^ +(^ 



c)K{l + KY 



II A II 

Note that the best possible SGD error bound is J' "^^A + f4^ with a = \q and c = 1. We see 
that i?||/(^)||^ is negligible compared to the error of SGD if t > 0((Ao7o)^^). Together with the 



analysis in Section [21 we conclude that ASGD begins to outperform SGD after t > 0((Ao7o)^'^)- 
The conclusion we draw in this section applies not only to the case of y with constant norm. Similar 
conclusion can be drawn if y is normally distributed or if each dimension of y is independently 
distributed, and/or if L2 regularization is used. 

Based on above analysis, for linear regression problems, we propose to use the following values 
for ([3|) to calculate the learning rate: 70 = 1/Af, a = Xq, c = 2/3. We will see that in the next 
section for general non-quadratic loss, optimal c is different since we need to further consider the 
convergence of Q . 

4. Non-quadratic loss 

For non-quadratic loss, we need to analyze the contribution of ^^'^^ to the error. We need the following 
two additional assumptions: 



E[uf^\\A-^\0,-i)<c40,-i- 0*111 (12) 

t t 

^i?(||A,||^)<C5||Ao||Uc6E7t (13) 



Similar to ©, (HU is related to the continuity of g{9, d) and the distribution of x and y. Similar to 
i s relat ed to the convergence of standard SGD. We note that the asymptotic normality of 



9t (jFabianl . 119681) suggests that assumption (IT^ is reasonable. 



Lemma 8 With Assumption ^ MOjl WJ(l and U3\) , we have 

tE\\li^)\\\ < ^^ + ^"^''=' ((1 + 2c2)c5||Ao||i + (2C2C3II Aolli + (1 + 2c^)ceH + 4hlf) 

where 7J = E*s=i7s- 

Corollary 9 For non- quadratic loss, with assumption (0) ilO]) ilS\} and ilS\) . if c > ^, then 9t 
satisfies 

tEpt - e*\\\ < tr{A-^S) + O (f-(^-i/2)^ ^ Q (f-(i-=)) 

Proof Note that 

(£;iiAji)^/2 < (i?ii/(°)|ii)i/^ + (i?ii/(^)iii)^/^ + {Ewi^'rA)'^' + {Ewi^^rA)'^' 

The corollary follows by applying ((T6| , ([T7| , Lemma [5] and Lemma El ■ 

The best convergence rate is obtained when c — 3/4, which is different from that for quadratic loss. 

5. Implementation 

In this section, we discuss how we implement ASGD for linear models fe{x) — 9 x with L2 regular- 
ization. The running average can be recursively updated by 9t = [1 ~ i)^f_i -I- i^t, which is very 
easy to implement. However, for sparse data sets, this can be very costly compared to SGD since 9t 
is typically a dense vector. Consider the following average procedure: 

9t = {I ~ Xit)Ot-i ~ itgt , 9t ^ {1 - 7]t)9t-i + T]t9t 



where A is the L2 regularization coefficient, gt = gg^ — Ls{Of_iXt,yt)xt, and r/t is the rate 

of averaging. Hence gt is sparse when xt is sparse. We want to take the advantage of the sparsity of 
Xt for updating 0t and Of Let 

at = —t — - — -— - , Pt = t .-, : ' "t = "t^t ' ^t = PA 



After some manipulation, we get the foUowing: 
ut = ut-i~ atjtgt 



t 



PiVi 



Ut = ut-i+ (5trit0t =uo + y^ —Ut 

^ ai 



Now define n = X^Li ^^ ^^^^ Wt = Uf_i + n-iat'jtgt with uq = "o, we get 



Ut = Uo + nWt + ^ Tt-lUjJjgj = TtMt + Wt 



Hence we obtain the fohowing efficient algorithm for updating 0t 



Algorithm 1 Sparse ASGD 



ao = 1 , /3o = 1 , To = , uq ^ 00 , uq 
while t < T do 

9t ^ Ls{--^uf_j_xt,yt)xt 



^ at-i 
Qt„i 

Ut = ut-i - atjtgt 
lit = ut-i +Tt-iatjtgt 
Tt = Tt-i + 2i^ 
end while 



At any step of the algorithm, 0t can be obtained by ^t = ^ = nut+Mt ^ Note that in Algorithm 
[1] none of the operations involves two dense vectors. Thus the number of operations per sample is 
0{Z), where Z is the number of non-zero elements in a;. 

From Theorem[T]we can see that if ||Ao||^_i is large compared to tr(A^^5), then the error is 
dominated by 7^°^ at the beginning. This can happen if noise is small compared to ||Ao||. It is 
possible to further improve the performance of ASGD by discarding 0t from averaging during the 
initial period of training. We want to find a point to whereafter averaging becomes beneficial. For 
this, we maintain an exponential moving average 0t — O.990t_i + O.Ol^t and compare the moving 
average of the empirical loss of 0t and 0t- Once 9t is better than 0t, we begin the ASGD procedure. 

7 



6. Experiments 

In this section, we provide 3 sets of experiments. The first experiment illustrate the importance 
of learning rate scheduling for ASGD. The second experiment illustrates the asymptotic optimal 
convergence of ASGD. In the third set of experiments, we apply ASGD on many public benchmark 
data sets and compare it with several state of the art algorithms. 

6.1 Effect of learning rate scheduling 

Our first experiment is used to show how different learning rate schedule affects the convergence of 
ASGD using a synthetic problem. The exemplar optimization problem is ming Ex{{0 — x)'^ A{9 — x)) , 
where A is a symmetric 100x100 matrix with eigenvalues [1, 1, 1,0.02 • • -0.02] and x follows normal 
distribution with zero mean and unit covariance. It can be shown that the optimal is 9* — 0. 
Figure [T] shows the excess risk £{0t) — £{0*) of the solution vs. number of training samples t. We 
note that in this particular example the excess risk is simply 0fA9t- For the good example of ASGD 
(ASGD in the figure), we use our proposed learning rate schedule 74 = (1 + 0.02t)~^/^ according to 
Section[3l For a bad example of ASGD (ASGD.BAD in the figure), we use 74 = (1 + 1)"^/^, which 
looks simple and also has optimal asymptotic convergence according to Gorollary [2] Figure [T] also 
shows the performance of standard SGD using learning rate schedule 74 = (1 + 0.02t)~^ and batch 
method 9t = jY^^^i^t- We see that both ASGD and ASGD_BAD eventually outperforms SGD 
and come close to the batch method. However, it takes only a few thousands example for ASGD 
to get to the asymptotic region, while it takes hundreds of thousands of examples for ASGD_BAD. 
This huge difference illustrates the significant role of learning rate scheduling for ASGD. 



SGD 

ASGD 

Batch 

— ASGD BAD 




10 
training size t 



Figure 1: ASGD with proposed learning rate schedule (ASGD) and an arbitrarily chosen learning 
rate schedule (ASGD_BAD). 



6.2 Asymptotic optimal convergence 

Our second experiment is used to show the asymptotic optimality of ASGD for linear regression. For 
this purpose, we generate synthetic regression problem y = x'^9* +e, where x is N = 100 dimensional 
vector following Gaussian distribution with zero mean and covariance A, the eigenvalues of A are 
evenly spread from 0.01 to 1, 9* is a vector with all dimension equal to 1, e follows Gaussian 
distribution with zero mean and unit variance. We compare ASGD with SGD and batch method. 
We use 7o = l/tr(A) for both ASGD and SGD. For batch method, we simply calculate 9t as 
9t = Cl2i=i^i^f)~^J2i=i^iyi- Figure [2] shows the excess risk £{9t) — £{9*) of the solution vs. 



number of training samples t. As the figure shows, after about 10^ examples, the accuracy of ASGD 
starts to be close to batch solution while the solution of SGD remains more than 10 times worse than 
ASGD. Note that although ASGD and batch solution has similar accuracy, ASGD is considerably 
fast than batch method since ASGD only need 0{N) computation per sample while batch method 
need 0{N'^) computation per sample. 




10 10 

training size t 



10 



Figure 2: Compare ASGD with batch method. 



6.3 Experiments on benchmark data sets 

In the third set of experiments, we compare ASGD with sever al other algorithms for tr aining large 
scale linear models: online liir iited-meniory B FGS (oLBFGS) of ISchraudolph et al.l (|2007l ) . stochastic 
gradie nt descent (SGD2) o f 'Bottoul (120071). d ual coordinate desc ent (LIBLINEA R) of iFan et sT 



(|2008f ). Pegasos of lShalev-S hwartz et al.l (|2007l) and SGDQN of Bo rdes et al.l (|2009l) . We performed 
extensive evaluation of ASGD on many data sets. Due to space limit, we only show detailed results on 
four tasks in this paper. COVTYPE is the detection of class 2 among 7 forest cover types (Blackard 
et al). All dimensions are norm alized between and 1. DELTA is a synthetic data set from the 



PASCAL Large Scale Challenge (jSonnenburg et al.l , l2008l) . We use the default data preprocessing 



provided by the challenge organi zers. RCVl is th e classification of documents belonging to class 
CCAT in RCVl text data set (,Lewis et al.l . 120041 ). We use the same preprocessing as provided 
in iBottoul (|2007). MNIST9 is the classification of digit 9 against all other digits in MNIST digit 
image data set (JLeCun et al.L 119981 ). For this task, we generate our own image feature vectors for 
recognition. The experiments for these four tasks use squared hinge loss L(s, y) = i(max(0, l — ys))^ 
with L2 regularization R{9) = f H^Hi- Since Aq is unknown, we use the regularization coefficient A as 
Ao, which is a lower bound for true Aq. Table [T] summarizes the data sets, where M is the max||a;|p 
calculated from 1000 samples, io is the point where average begins (See Section [5]). Figure [3] shows 
the test error rate (left), elapsed time (middle) and test cost (right) at different points within first 
two passes of training data. 

We also include more experimental results on data sets from Pascal Large Scale Challenge. 
However, to save space, we only show figures for test error rate. All experiments use the default 
data preprocessing provided by the challenge organizers. Table [2] summarize the data sets. Figure |4] 
and Figure E] shows result for L2 SVM, logistic regression and SVM. LIBLINEAR is not included in 
the figures for logistic regression because the dual coordinate descent method used by LIBLINEAR 
cannot solve logistic regression. Although the theory of ASGD only applies to smooth cost functions, 
we also include the results of SVM to satisfy the possible curiosity of some readers. 



As we can see from the figures, ASGD clearly outperforms all other 5 algorithms in terms accuracy 
in most of the data sets. In fact, for most of the data sets, ASGD reaches good performance with 
only one pass of data, while many other algorithms still perform poorly at that point. The only 
exception is the beta data set, where all methods performs equally bad because the two classes in 
this data set are not linearly separable. Moreover, the performance of the other 5 methods tend 
to be more volatile, while performance of ASGD is more robust due to average. In terms of time 
spent on one pass of data, ASGD is similar to the other methods except oLBFGS, which means that 
ASGD needs less time to reach similar test performance compared to the other methods. Another 
interesting point is that although the current theory of ASGD is based on the assumption that cost 
function is smooth, as shown in the figures, ASGD also works pretty well with non-smooth loss such 
as hinge loss. 



Table 1: Data Set Summary 





description 


type 


dim 


train size 


test size 


A 


M 


to 


covtype 


forest cover type 


sparse 


54 


500k 


81k 


10-^ 


6.8 


100 


delta 


synthetic data 


dense 


500 


400k 


50k 


10-2 


3.8 X 10^ 


100 


rcvl 


text data 


sparse 


47153 


781k 


23k 


10-^ 


1 


781 


mnistO 


digit image features 


dense 


2304 


50k 


10k 


10-^^ 


2.1 X lO'' 


128 







Table 2: 


Data Set Summary 








description 


type 


dim 


train size 


test size 


A 


M 


alpha 


synthetic data 


dense 


500 


400k 


50k 


lO-'' 


1 


beta 


synthetic data 


dense 


500 


400k 


50k 


10-'* 


1 


gamma 


synthetic data 


dense 


500 


400k 


50k 


10-3 


2.5 X 10-3 


epsilon 


synthetic data 


dense 


2000 


400k 


50k 


10-5 


1 


zeta 


synthetic data 


dense 


2000 


400k 


50k 


10-5 


1 


fd 


character image 


dense 


900 


1000k 


470k 


10-5 


1 


ocr 


character image 


dense 


1156 


1000k 


500k 


10-5 


1 


dna 


DNA sequence 


sparse 


800 


1000k 


1000k 


10-3 


200 



7. Conclusion 

ASGD is relatively easy to implement compared to other algorithms. And as demonstrated on 
both synthetic and real data sets, with our proposed learning rate schedule, ASGD performs better 
than other more complicated algorithms for large scale learning problems. In this paper, we only 
apply ASGD to linear models with convex loss, which has unique local optimum. It would be more 
interesting to see how ASGD can be applied to more complicated models such as conditional random 
fields (CRF) or models with multiple local optimums such as neural networks. 
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Figure 3: Left: Test error (%) vs. number of passes. Middle: Test error vs. training time. Right: 
Test cost vs. number of passes. 
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Appendix A. Proofs 

Lemma 10 Lei k = 1 — max(0, 2c — 1)^. If JqXi < 1, then 

V7fe+1 Ik J 7fe+l V7*: Ik-lJ Ik 

Proof For < c < 0.5, let f{x) = [x'^ — (x — l)^)x'^, where x = fc H — —. We only need to show 

n^) < 

f'{x) = 2cx2=-i - c{x - ly-^x" - c{x - Ifx''^^ 
= 2cx''-\x''-{x-lY-]-{x-lY-^) 

< 2cx''-\ix - ly + c{x - ly-^ -{x- ly - hx - ly-^) 
= c{2c - i)x''-\x - ly-^ < 

where we used the fact x'^ < {x — ly + c{x — ly^^ for < c < 1. 

For c > 0.5, let f{x) — log((x'^ ^ {^ ^ l)'^)^;^), where x — k + ^. We only need to show 

fix + 1) - fix) + ^i2£_3. iog(l _ Ao7o(a7o.T)-^) < 
By mean value theorem, there exists some y'.x<y<x-{-l s.t. f{x + 1) — fix) = fiy). Hence 

fix + 1) - fix) + ^12^_11 iog((l _ Ao7o(«7o2:)-^) 

< fiy) - a(2c- l)7o(a7o.T)-^ < fiy) - (2c - l)(a7o)i-=y-= 
2ciy- -iy-iy-liy- 1)^-1) (2c - l)ia-foy)'-- 

yiy" - iy - '^)'') y 



< 



2ciy--iy-\y-\iy-l)-^) 2c - I 

yiy" - {y - '^Y) v 

y^^iy- ly - ciy - l)^-^ 



yiy"- iy- i)'') 



<0 



The following is a key lemma which is used several times in this paper. 
Lemma 11 Let Xj and Xj be 

t t 

X',^l[iI-%A) , X]^Ifor]>t , X^^Y.^,X^+i 

i=j i=j 
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// 70-^1 l£ 1 o,i^d (2c — l)a < Xo, then we have the following bound for X^. 

{I - X*)A-^ < X] < (1 + co(l + a7ojT"')^"' < (1 + co)A-i 
where Cq is the same as in Theorem{l\ 
Proof It is easy to verify the following relation by induction on t, 



Y,^,X^'^{I-Xf)^^ (14) 



Now we calculate the difference between Xj and X]i=i 7*^j^^' 

t t t 






t 



It is clear that from the first line of above equation that X* — Xli=, Ji^j~^ > 0- Hence we obtain 
the first inequality of the lemma. We have 

{i-Xoiky'i<{i-"tkA)-^ 

By Lemma fTOj we have 

\lk+l Ik J Ik+l \lk Ik-lJ Jk 

Hence 

7fc Ik-iJ Ik ■'^ Xlj+i IjJ 7j+i ■' 
Define F/ as F/ = Ilt=jil - i^liA)- Since < k < 1, we have (X^)'^ < F/. Hence 

^^* ^ ^J+1 k=j+l 

= -(^-l)A-^X^^^, + :il^^A-'iI-YU,) 
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< J^l^^l2±lA-^ = — ((1 + «7oO- + 1))^ - (1 + a7oj))^)A-' 
K7i I3I3+1 «7i 

^ ac7o(l + a7oj)^~^ ^^_2 ^ ac7o(l + 070^) ^'^ A-^ 



K-fi 



K-fi 



An 



= co{l + a-fojr-^A-^ 

Now plugging (|14l) into above inequality, we obtain the claim of the lemma. 

With Lemma [TTl we can now prove Theorem [TJ 
Proof (Theorem [T|) From (IS|), we get 

1 ' 

At ^ At-i ~ jt{AAt-i + (t) , At = -^A, 

From (ITSt. we have 






then 



t J 



t k 



At 



7EA.=7En(^-7.^)Ao+7E E n(^-7<^) 7.< 



j=i 



]=i i=i 



j = l \k=ji=j + l 



;i^(X*-7o/)Ao + i5:x;e,=/(°)+/« 



J=l 



where X,' is defined in Lemma [TTl Hence 



(15) 



tEi\\I^''^\\\) = 4tAJA(X* - 7o/)^Ao < ii±^AjA-iAo 
70*- 7o '- 



(16) 



^i?(ii/'^^iii) = 7 E i?(ej^(^j)^e,) < 7 E(i + ^0(1 + a^,jr-'fEi^TA-%) 



j=i 



i=i 



< 1 



< 1 



2co + cl 



t 



E(l + «Wr-Mtr(A-i5)< f 






(2co + cg)((l + a7ot)^-f) 
ac^ot 



{2co + 4){l + a-foty 



tr{A-^S) 



And we have ^((/(o))^^/(i)) = since E{Q = 0. 



Proof (Lemma [5]) 



tT{A^^S) 
(17) 



t£;|i/(2)||3j = t£; 



1 * 



tc(2) 



-7E^II^X 



tc(2)||2 



J = l 



7E^(^f "^(^D'^f ) ^ 7E(i + ^o)^i?(er^-^e 



(2)T^_l^(2)^ 
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(l + co)2ci 



t-1 



(l+C2)||Ao||^+C3^7j- 



t 



(l + C2)||Ao||l 



2 ^ C3((l + a7oi)'^'-l^ 



a(l - c) 
< (1 + co)^c, (i±^|| Aoll^ + 1^(1 + «^"^)"^) 



Proof (Lcmnia[7]) Let J^x = E{xx^). We have the foUowing: 

m = E{g{e,d)) = 11J ^ E{xy) 

A^Y.^ , 6 = E{xy) , r = A-^b 

^(2) = g(0, d) ~ g{9*,d) - m = {xx^ - S,)(0 - r ) 

By the assumption of this lemma, we get 



E{xx'^A~'^xx'^) < —E{xx'^xx^) < —A 
Ao Aq 



From (UHl) and ([TH), we get 



e{u^'^\U\o) 



< 



Ao 



r\\\ 



(18) 



(19) 



Lemma 12 For linear regression problem l{6,x,y) — ^{9'^x — y)^, assuming all ||x|p are M, then 
(0) will diverge if learning rate is greater than -jr . 

Proof Let Xf be defined as in Lemma [TT] We obtain the following from ([2]), 



At = (/ - ^tXtxf)At-i - it{xtx] 



xtyt) 



Let At — XfxJ , bt = xtyt, A = E{At), b = E{bt). Taking expectation with respect to Xt, yt, noticing 
that Ad* — b, we get 

E{At\et^i) = (/-7tA)At_i 
E{\\Atf\At^,) = Af_,E{I-2^tA + ^fAtAt)At_i 

+^^E{\\Ate* ~ btf) + 2^^E{9*^AtAt - bjAt)At-i 
= ||At_i||2 - {2jt - Mjf)\\At-i\\\ + jMS) + Hu'^t-i 

where 5 = E{{Ate* - bt){Ate* - bt)^), u = E{AtAte* - Atbt). Hence 

E{^A,f) ^ ii;(||At-i||2) - (27t - M7t')£;(||At_i||^) + 7^r(^) + 2f,u^X\-^A^ 

If^t >=i|+'5> if, then 

£;(||At||2) > ii;(||At_i||2) + ^(2 + JM)£;(||At„i||^) +72tr(5) + 2^lu^ X{-^ A^ 
> (1 + Ao(5(2 + 8M))E[\At-if) + 7'tr(^) + 272u^X*-1Ao 
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Noticing that X^ '^ — > as t — > cx), we conclude that i?(|| At|p) is diverging if 7* > 



M- 



Proof (Lemma E]) Let 7* = Y,j=i'yj^ 



tEwi^^m < it^mfiiA+jtt ^(^r^'.^^i^i^') 



t 



j=i k=j+i 



< i^(l + cofiJ||e( 



2p|lt(3)||2 

A- 



< 



< 



< 



< 



< 



< 



< 



J = l 



(l + co)'cl 



(l + co)'cl 



il + co)'cl 



{l + co)'cl 



(l + co)2cl 



(l + co)^cl 



j = l :; = 1 k=j+l 

^ii;||A,||U2E^(llMlA E ^(IIa.IIaI^,) 



t 






fc=J+l 



^£;||A,||U2E^ l!M!ih2llA,||Uc3 E ^'^ 



i 



k=j+l 



{1 + 2c,)J2e\\A,\\\ + c,jI) + 2csJ2e{\\A,\\\) Y, 7fc 



j=i 






fe=j + l 



(l + 2c2)(c5||Ao||i + C67*) + 2c3E7fcE^(ll^:'-|lA) 

fc=2 J = l 

(l + 2c2)(c5||Ao||i + C67D + 2c3E7fc(c2||Ao||^ + C37f-i) 



fc=2 



^i±|^ ((1 + 2C2)(C5|| Aolli + C67l) + 2C2C3II Ao||i7l + cMf) 
^^^^°^'^' ((1 + 2c2)c5|lAo||i + (2c2C3|lAo|li + (1 + 2c2)cehl + cK^lf 
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Figure 4: Test error (%) vs. number of passes. Left: L2SVM; Middle: logistic regression; Right: 
SVM. 
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Figure 5: Test error (%) vs. number of passes. Left: L2SVM; Middle: logistic regression; Right: 
SVM. 
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